A novel machine learning prediction model for metastasis in breast cancer

Abstract Background Breast cancer (BC) metastasis is the common cause of high mortality. Conventional prognostic criteria cannot accurately predict the BC metastasis risk. The machine learning technologies can overcome the disadvantage of conventional models. Aim We developed a model to predict BC metastasis using the random survival forest (RSF) method. Methods Based on demographic data and routine clinical data, we used RSF‐recursive feature elimination to identify the predictive variables and developed a model to predict metastasis using RSF method. The area under the receiver operating characteristic curve (AUROC) and Kaplan–Meier survival (KM) analyses were plotted to validate the predictive effect when C‐index was plotted to assess the discrimination and Brier scores was plotted to assess the calibration of the predictive model. Results We developed a metastasis prediction model comprising three variables (pathological stage, aspartate aminotransferase, and neutrophil count) selected by RSF‐recursive feature elimination. The model was reliable and stable when assessed by the AUROC (0.932 in training set and 0.905 in validation set) and KM survival analyses (p < .0001). The C‐indexes (0.959) and Brier score (0.097) also validated the good predictive ability of this model. Conclusions This model relies on routine data and examination indicators in real‐time clinical practice and exhibits an accurate prediction performance without increasing the cost for patients. Using this model, clinicians can facilitate risk communication and provide precise and efficient individualized therapy to patients with breast cancer.

and the natural environment, the morbidity and mortality rates of BC have increased in China recent years. 2,3st of breast cancer-related deaths are associated with metastasis that over 90% are attributed to metastasis-related complications. 4 Metastatic BC remains incurable despite improvements in early detection and advances in treatment because metastatic BC is refractory to almost all current treatments and most of treatments are not curative but just merely palliative. 4,5On one hand, identification of BC metastasis risk could inform approaches to early detection and prevention by additional interventions.On another hand it is critical to accurately predict metastasis for precision medicine and individualized therapy, thus avoiding the need for toxic and costly therapies.
However, inappropriate screening examinations or the overuse of diagnostic tests has increased healthcare costs.7][8] Therefore, conventional prognostic criteria cannot predict the metastasis risk accurately in patients with BC.Consequently, many patients unnecessarily receive cytotoxic chemotherapy. 8Accurate prediction of BC metastasis risks could help to reduce the public health and social burdens of breast cancer.As the gene technology developed, researchers have applied the integration of multiple genetic and molecular markers to develop newer models to predict the prognosis of breast cancer patients.][10] However, gene detection requires cutting-edge technology, making it expensive.Furthermore, its utility has only been established in a certain patient subset 8,[11][12][13] ; thus, it is not widely used.
5][16][17] PREDICT and Adjuvant!5][16][17] Cox regression is generally employed to identify predictors but involves restrictive assumptions, such as proportionality of hazards and linearity, 18 which may introduce bias into the prognostic analysis of BC patients during long-term follow-up and hinder the identification of prognostic markers. 19,20So, a simple and accurate predictive model with high clinical applicability and generalizability is needed to predict BC metastasis urgently.
The machine learning methods can construct predictive models that can evaluate numerous variables efficiently, overcoming the disadvantage of conventional models.Random survival forest (RSF), developed from random forest and survival analysis, is a machinelearning method 21 that has no restrictions on the data distribution, making it a non-parametric method and can be applied to analyze data with a significantly larger number of variables than the sample size.
3][24] Further, there are no special requirements for the data type or the association between outcomes and predictive variables, and it is not constrained by logarithmic linear assumption or proportional risk assumption. 21,25We employed the RSF method based on baseline clinical parameters, including general information of patients, pathological examinations, and blood tests to develop a new predictive model for BC metastasis occurrence.

| Patients and study design
We retrospectively investigated the medical records of BC patients two independent institutions from January 2013 to December 2020.Patients with stage 0 to III primary BC were enrolled and all enrolled patients accepted primary BC treatment.Patients with stage IV breast cancer, with other synchronous malignancies, with other cancer history, with incomplete information (lacking >50% parameters) or lost to follow-up were excluded.Patients from the third affiliated hospital of Sun Yat-sen University were assign to the training set to develop the model, and patients from Liuzhou women and children's medical center were assign to the validation set validate the model.Figure 1 presented the flowchart of the study design and patient selection.

| Statistical analyses
The baseline demographic and clinical characteristics of patients were presented as percentages or means with standard deviations.Continuous data were evaluated using the Student's t-test or the Mann-Whitney U test, and categorical data were evaluated using the Chisquared test.All statistical analyses were performed using R software (R Foundation for Statistical Computing, Vienna, Austria).All analyses were two-tailed, and differences were statistically significant at p < .05.

| Data preprocessing
Missing data in the training and validation sets were interpolated by the "mice" package in R (predictive mean matching [PMM]).

| Predictor selection
Based on the clinical date, random forest-recursive feature elimination (RF-RFE) (run by the "caret" package in R) was applied to select the best variable set (a positive variable importance [VIMP] value calculated by RF-RFE indicates that one variable improves predictive accuracy, while a negative value indicates an adverse effect in the prediction). 26he RSF method was applied to develop a model to predict BC metastasis risk.To evaluate the accuracy of the model, we took the root mean square error (RMSE) that occurred mathematically between the test and predicted values.The higher the prediction accuracy, the lower the RMSE. 27All pairs of mtry and ntree were developed by a grid search employing 10-fold cross-validation, and those with the best concordance index (C-index) were determined as optimized parameters.We used the C-index to assess the discrimination of the predictive model (0.5-0.7 represents weak discrimination power, 0.7-0.9represents moderate discrimination power, and >0.9 represents strong discrimination power). 28We used the Brier scores to evaluate the calibration of the model.The Brier score measures the calibration of the model by taking the mean squared error between the predicted probabilities and the observed outcomes.It ranges from 0 to 1, a lower score indicating higher accuracy. 29,30Brier scores <0.25 showed relatively good calibration from automated modeling. 29,30A receiver operating characteristic (ROC) curve and Kaplan-Meier (KM) survival analysis were applied to assess the precision of the predictive model.BC metastasis is the end-point event.

| RESULTS
We enrolled 774 patients, 623 patients were included in the training and 151 patients were included in validation sets for model  1.
The RF-RFE run using the R "caret" package was applied to filter the most predictive set of variables, and the optimal number of variable sets was selected according to RMSE. Figure 2 shows that the RMSE value was the lowest when there were three variables; thus, the three-variable set was the most predictive variable.RSF with the "RandomForestSRC" package in R was applied to develop the model.The error rate of the model gradually stabilizes with the increase in the number of fixed trees (Figure 4).
Between 4000 and 6000, the out-of-bag error rate decreases steadily, reaching close to 0.3, and when the fix trees were 10 000, the error rate is significantly stable.Thus, it is sufficient and reasonable to select 10 000 trees (ntree = 10 000), and the best predictive variables (ntree = 10 000, mtry = 4) were chosen for the development of the RSF predictive model.And then   low-risk groups indicating that patients with higher prediction scores are more vulnerable to BC metastasis ( p < .0001)(Figure 6).
The results indicated that this RSF predictive model could accurately predict the metastasis in breast cancer patients.
Evaluation of the number of variables in the optimal set using the root mean square error (RMSE).

| DISCUSSION
The RF-RFE algorithm, 31 a machine learning method, was applied to automatically select the most important predictive variables to further RSF model building.Variable selection is the process of selecting a data set of predictive variables for further analysis to minimize possible generalization error.
The three best variables for this model included TNM stage, AST level, and neutrophil count.These were all from blood tests and pathological examinations, and no variable based on general information of patients was selected.However, previous studies have reported that TNM stage and neutrophil count were closely related to the prognosis of BC, and we used them to build a reliable model.The most important variable was TNM stage, which is applied widely in clinical practice to predict survival and prognosis and guide clinical decision-making. 32The enzyme AST is abundantly present in hepatocytes and skeletal, cardiac, and smooth muscles and is released into the bloodstream in hepatitis, myocardial infarction, or myositis.High AST levels are independently associated with the prognosis of both hepatic tumor metastases and metastases from a primary hepatic source. 33,34evious studies have indicated that high AST levels may be associated with aggressive tumor biology or could be explained as a more aggressive tumor caused by high tumor cell turnover and tissue damage. 35,36To our knowledge, this was the first study to use AST level in predicting BC metastasis.The neutrophil count is a routine blood test in clinical practice.It was reported that the change in white blood cell (WBC) count in the peripheral blood is associated with systemic inflammatory response. 37Further, the tumor-related systemic inflammatory response has been proven to be an independent predictor of tumor prognosis. 38,39[42] This predictive model was established employing both patient and tumor characteristics.This predictive model had good performance in the field of validity and reliability even under external validation on an independent cohort.For the training and external validation cohorts, the C-indexes achieved 0.959 and 0.917, respectively, showing good discrimination.The C-indexes of previous developed predictive models ranged from 0.65 to 0.71, 15,43 which means that this model was more accurate when compared to previous models.Kaplan-Meier analyses were applied to assess the performance of this model and results indicated that our model had a good performance in predicting BC metastasis ( p < .0001 in both the training and validation sets).Moreover, the AUROCs were 0.932 and 0.905 in the training and validation sets, respectively, which means that this model had a good predictive effect on BC metastasis.5][46] Further- development and model validation, respectively.Forty-one variables were included and 22 variables needed interpolation in training set and 11 variables needed interpolation in validation set.All missing data with missing rate less than 20%.Baseline characteristic of the training and validation sets is present in Table The RF-RFE algorithm automatically reviewed the general information of the patients, pathological examinations, and blood tests during treatment to select the most relevant features for further RSF model development.The best three variables filtered by RF-RFE comprised pathological (TNM) stage, aspartate aminotransferase (AST), and neutrophil count.The VIMP values calculated by RF-RFE are present in Figure3.However, no variables from general information were selected by the RF-RFE algorithm.
Flowchart of study design and patient selection.
T A B L E 1 Basic information of the training and validation sets.

F I G U R E 3
The variable importance (VIMP) values derived from random forest-recursive feature elimination analysis.Depending on the optimal cutoff value of the RSF-based score in the training set, patients in the validation set were divided into high-risk group and low-risk group.The Kaplan-Meier analyses demonstrated significantly different in time to metastasis-free survival between the high-risk group and low-risk group ( p < .001)(Figure 8) which validated the good predictive ability of this model.

F I G U R E 4
Change in the prediction error rate of metastasis risk model of breast cancer patients with tree number.F I G U R E 5 Receiver operating characteristic curve depicting performance of the developed random survival forest predictive model.F I G U R E 6 Kaplan-Meier curves for metastasis-free survival for the training set.
more, this model had good Brier scores of 0.113 and 0.097 for the training and external validation sets, respectively, showing good calibration.This model based on routine demographic and clinical examination data in real-time clinical practice and exhibits a high accuracy of prediction without increasing the medical expense and that was different from developed predictive models relied on new molecular biomarkers derived from gene or protein expression analysis.Considering that new molecular biomarkers are not tested routinely in clinical practice, the medical expense of identifying and exploying routine laboratory parameters is lower than that of employing new molecular biomarkers.Thus, this leads to additional patient expenditures and is not covered by insurance.We do not mean to negate the possible benefits of personalized care based on novel biomarkers but not all breast cancer patients can undergo the test of a novel biomarker and not all regions can perform the test for a novel biomarker.Thus, we need a practically simple and economically viable model to predict BC metastasis.The model established and verified in our study incorporated a comprehensive selected feature for both patient-related features and tumor features to provide an easy-to-operate and individualized prediction of metastasis in BC patients without additional cost.This model can help clinicians stratify cases into the high and low risk of early-stage metastasis.Thus, BC patients at low risk of metastasis can avoid the need for toxic and costly therapies, while BC patients at high risk of metastasis can undergo a more aggressive system therapy such as a more intense chemotherapy or aggressive targeted therapy for every Her-2 positive patient and a more intense follow-up scheme.Moreover, prediction of BC metastasis risks may better manage patient and caregiver expectations, help patients decide which therapies to choose, and even improve the patient compliance and patient care.Besides, considering life expectancy and competing risks of mortality, there is a risk of overtreatment of breast cancer in older individuals.47,48Predicted survival benefits, disease progress risk, effect on anticancer therapy toxicity, life expectancy, quality of life, and patient preferences should be considered carefully when making decision for older BC patients.The treatment decision making of breast cancer in older individuals should involve geriatric assessment F I G U R E 7 Receiver operating characteristic curve depicting performance of the developed random survival forest prognostic model in the validation set.F I G U R E 8 Kaplan-Meier curves for metastasis-free survival for the validation set.and survival estimates.[47][48][49][50]Our model can provide the BC metastasis risk to contribute to make decision in the therapy of older BC patients.Over all, our model can help clinicians to provide more targeted and more accurate individualized therapy to improve the prognosis of breast cancer patients.There were some limitations in this study.First, all enrolled patients were Han descent.Thus, this study lacks validation for other races.Therefore, validation of these results in other regions and races is needed in the future.Second, this was a retrospective crosssectional study performed at two centers, and the number of enrolled patients was small.In the future, studies conducted at multiple centers with larger cohorts and longer observation periods are required.Third, AST level and neutrophil count can be affected by a diverse range of factors.Our enrolled patients did not have a high elevated AST level (>5 ULN) and abnormal neutrophil count; hence, further studies are needed to verify whether this model is valid for patients with significantly abnormal AST level and neutrophil count.Fourth, not all of the breast cancer patients enrolled underwent the detection of novel biomarker.Hence, we only compared the C-index and AUR-OCs between our model and other model, but there was no direct comparison of our data-set to other models.5 | CONCLUSIONSThis study developed and validated a model to predict metastasis in BC patients from China.The predictive parameters were selected from routine used data in real-time clinical practice without adding medical expense.This machine learning method based model can predict metastasis in BC patients accurately with good discrimination and calibration.Clinicians can provide precise and efficient individualized therapy for patients with BC by using this model so as to improve the prognosis of breast cancer.